35 research outputs found
Statistical significance of variables driving systematic variation
There are a number of well-established methods such as principal components
analysis (PCA) for automatically capturing systematic variation due to latent
variables in large-scale genomic data. PCA and related methods may directly
provide a quantitative characterization of a complex biological variable that
is otherwise difficult to precisely define or model. An unsolved problem in
this context is how to systematically identify the genomic variables that are
drivers of systematic variation captured by PCA. Principal components (and
other estimates of systematic variation) are directly constructed from the
genomic variables themselves, making measures of statistical significance
artificially inflated when using conventional methods due to over-fitting. We
introduce a new approach called the jackstraw that allows one to accurately
identify genomic variables that are statistically significantly associated with
any subset or linear combination of principal components (PCs). The proposed
method can greatly simplify complex significance testing problems encountered
in genomics and can be utilized to identify the genomic variables significantly
associated with latent variables. Using simulation, we demonstrate that our
method attains accurate measures of statistical significance over a range of
relevant scenarios. We consider yeast cell-cycle gene expression data, and show
that the proposed method can be used to straightforwardly identify
statistically significant genes that are cell-cycle regulated. We also analyze
gene expression data from post-trauma patients, allowing the gene expression
data to provide a molecularly-driven phenotype. We find a greater enrichment
for inflammatory-related gene sets compared to using a clinically defined
phenotype. The proposed method provides a useful bridge between large-scale
quantifications of systematic variation and gene-level significance analyses.Comment: 35 pages, 1 table, 6 main figures, 7 supplementary figure
Concept Saliency Maps to Visualize Relevant Features in Deep Generative Models
Evaluating, explaining, and visualizing high-level concepts in generative
models, such as variational autoencoders (VAEs), is challenging in part due to
a lack of known prediction classes that are required to generate saliency maps
in supervised learning. While saliency maps may help identify relevant features
(e.g., pixels) in the input for classification tasks of deep neural networks,
similar frameworks are understudied in unsupervised learning. Therefore, we
introduce a new method of obtaining saliency maps for latent representations of
known or novel high-level concepts, often called concept vectors in generative
models. Concept scores, analogous to class scores in classification tasks, are
defined as dot products between concept vectors and encoded input data, which
can be readily used to compute the gradients. The resulting concept saliency
maps are shown to highlight input features deemed important for high-level
concepts. Our method is applied to the VAE's latent space of CelebA dataset in
which known attributes such as "smiles" and "hats" are used to elucidate
relevant facial features. Furthermore, our application to spatial
transcriptomic (ST) data of a mouse olfactory bulb demonstrates the potential
of latent representations of morphological layers and molecular features in
advancing our understanding of complex biological systems. By extending the
popular method of saliency maps to generative models, the proposed concept
saliency maps help improve interpretability of latent variable models in deep
learning.
Codes to reproduce and to implement concept saliency maps:
https://github.com/lenbrocki/concept-saliency-mapsComment: 18th IEEE International Conference on Machine Learning and
Applications (ICMLA
Jaccard/Tanimoto similarity test and estimation methods
Binary data are used in a broad area of biological sciences. Using binary
presence-absence data, we can evaluate species co-occurrences that help
elucidate relationships among organisms and environments. To summarize
similarity between occurrences of species, we routinely use the
Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their
union. It is natural, then, to identify statistically significant
Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of
species. However, statistical hypothesis testing using this similarity
coefficient has been seldom used or studied.
We introduce a hypothesis test for similarity for biological presence-absence
data, using the Jaccard/Tanimoto coefficient. Several key improvements are
presented including unbiased estimation of expectation and centered
Jaccard/Tanimoto coefficients, that account for occurrence probabilities. We
derived the exact and asymptotic solutions and developed the bootstrap and
measurement concentration algorithms to compute statistical significance of
binary similarity. Comprehensive simulation studies demonstrate that our
proposed methods produce accurate p-values and false discovery rates. The
proposed estimation methods are orders of magnitude faster than the exact
solution. The proposed methods are implemented in an open source R package
called jaccard (https://cran.r-project.org/package=jaccard).
We introduce a suite of statistical methods for the Jaccard/Tanimoto
similarity coefficient, that enable straightforward incorporation of
probabilistic measures in analysis for species co-occurrences. Due to their
generality, the proposed methods and implementations are applicable to a wide
range of binary data arising from genomics, biochemistry, and other areas of
science
Integration of Radiomics and Tumor Biomarkers in Interpretable Machine Learning Models
Despite the unprecedented performance of deep neural networks (DNNs) in
computer vision, their practical application in the diagnosis and prognosis of
cancer using medical imaging has been limited. One of the critical challenges
for integrating diagnostic DNNs into radiological and oncological applications
is their lack of interpretability, preventing clinicians from understanding the
model predictions. Therefore, we study and propose the integration of
expert-derived radiomics and DNN-predicted biomarkers in interpretable
classifiers which we call ConRad, for computerized tomography (CT) scans of
lung cancer. Importantly, the tumor biomarkers are predicted from a concept
bottleneck model (CBM) such that once trained, our ConRad models do not require
labor-intensive and time-consuming biomarkers. In our evaluation and practical
application, the only input to ConRad is a segmented CT scan. The proposed
model is compared to convolutional neural networks (CNNs) which act as a black
box classifier. We further investigated and evaluated all combinations of
radiomics, predicted biomarkers and CNN features in five different classifiers.
We found the ConRad models using non-linear SVM and the logistic regression
with the Lasso outperform others in five-fold cross-validation, although we
highlight that interpretability of ConRad is its primary advantage. The Lasso
is used for feature selection, which substantially reduces the number of
non-zero weights while increasing the accuracy. Overall, the proposed ConRad
model combines CBM-derived biomarkers and radiomics features in an
interpretable ML model which perform excellently for the lung nodule malignancy
classification
Feature Perturbation Augmentation for Reliable Evaluation of Importance Estimators
Post-hoc explanation methods attempt to make the inner workings of deep
neural networks more interpretable. However, since a ground truth is in general
lacking, local post-hoc interpretability methods, which assign importance
scores to input features, are challenging to evaluate. One of the most popular
evaluation frameworks is to perturb features deemed important by an
interpretability method and to measure the change in prediction accuracy.
Intuitively, a large decrease in prediction accuracy would indicate that the
explanation has correctly quantified the importance of features with respect to
the prediction outcome (e.g., logits). However, the change in the prediction
outcome may stem from perturbation artifacts, since perturbed samples in the
test dataset are out of distribution (OOD) compared to the training dataset and
can therefore potentially disturb the model in an unexpected manner. To
overcome this challenge, we propose feature perturbation augmentation (FPA)
which creates and adds perturbed images during the model training. Through
extensive computational experiments, we demonstrate that FPA makes deep neural
networks (DNNs) more robust against perturbations. Furthermore, training DNNs
with FPA demonstrate that the sign of importance scores may explain the model
more meaningfully than has previously been assumed. Overall, FPA is an
intuitive data augmentation technique that improves the evaluation of post-hoc
interpretability methods
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Deep Learning Mental Health Dialogue System
Mental health counseling remains a major challenge in modern society due to
cost, stigma, fear, and unavailability. We posit that generative artificial
intelligence (AI) models designed for mental health counseling could help
improve outcomes by lowering barriers to access. To this end, we have developed
a deep learning (DL) dialogue system called Serena. The system consists of a
core generative model and post-processing algorithms. The core generative model
is a 2.7 billion parameter Seq2Seq Transformer fine-tuned on thousands of
transcripts of person-centered-therapy (PCT) sessions. The series of
post-processing algorithms detects contradictions, improves coherency, and
removes repetitive answers. Serena is implemented and deployed on
\url{https://serena.chat}, which currently offers limited free services. While
the dialogue system is capable of responding in a qualitatively empathetic and
engaging manner, occasionally it displays hallucination and long-term
incoherence. Overall, we demonstrate that a deep learning mental health
dialogue system has the potential to provide a low-cost and effective
complement to traditional human counselors with less barriers to access
Evaluation of importance estimators in deep learning classifiers for Computed Tomography
Deep learning has shown superb performance in detecting objects and
classifying images, ensuring a great promise for analyzing medical imaging.
Translating the success of deep learning to medical imaging, in which doctors
need to understand the underlying process, requires the capability to interpret
and explain the prediction of neural networks. Interpretability of deep neural
networks often relies on estimating the importance of input features (e.g.,
pixels) with respect to the outcome (e.g., class probability). However, a
number of importance estimators (also known as saliency maps) have been
developed and it is unclear which ones are more relevant for medical imaging
applications. In the present work, we investigated the performance of several
importance estimators in explaining the classification of computed tomography
(CT) images by a convolutional deep network, using three distinct evaluation
metrics. First, the model-centric fidelity measures a decrease in the model
accuracy when certain inputs are perturbed. Second, concordance between
importance scores and the expert-defined segmentation masks is measured on a
pixel level by a receiver operating characteristic (ROC) curves. Third, we
measure a region-wise overlap between a XRAI-based map and the segmentation
mask by Dice Similarity Coefficients (DSC). Overall, two versions of SmoothGrad
topped the fidelity and ROC rankings, whereas both Integrated Gradients and
SmoothGrad excelled in DSC evaluation. Interestingly, there was a critical
discrepancy between model-centric (fidelity) and human-centric (ROC and DSC)
evaluation. Expert expectation and intuition embedded in segmentation maps does
not necessarily align with how the model arrived at its prediction.
Understanding this difference in interpretability would help harnessing the
power of deep learning in medicine.Comment: 4th International Workshop on EXplainable and TRAnsparent AI and
Multi-Agent Systems (EXTRAAMAS 2022) - International Conference on Autonomous
Agents and Multi-Agent Systems (AAMAS
Foundations for Open Scholarship Strategy Development, Version 2.1 [Pre-print]
This document aims to agree on a broad, international strategy for the implementation of open scholarship that meets the needs of different national and regional communities but works globally.
Scholarly research can be idealised as an inspirational process for advancing our collective knowledge to the benefit of all humankind. However, current research practices often struggle with a range of tensions, in part due to the fact that this collective (or “commons”) ideal conflicts with the competitive system in which most scholars work, and in part because much of the infrastructure of the scholarly world is becoming largely digital. What is broadly termed as Open Scholarship is an attempt to realign modern research practices with this ideal. We do not propose a definition of Open Scholarship, but recognise that it is a holistic term that encompasses many disciplines, practices, and principles, sometimes also referred to as Open Science or Open Research. We choose the term Open Scholarship to be more inclusive of these other terms. When we refer to science in this document, we do so historically and use it as shorthand for more general scholarship.
The purpose of this document is to provide a concise analysis of where the global Open Scholarship movement currently stands: what the common threads and strengths are, where the greatest opportunities and challenges lie, and how we can more effectively work together as a global community to recognise and address the top strategic priorities. This document was inspired by the Foundations for OER Strategy Development and work in the FORCE11 Scholarly Commons Working Group, and developed by an open contribution working group. Our hope is that this document will serve as a foundational resource for continuing discussions and initiatives about implementing effective strategies to help streamline the integration of Open Scholarship practices into a modern, digital research culture. Through this, we hope to extend the reach and impact of Open Scholarship into a global context, making sure that it is truly open for all. We also hope that this document will evolve as the conversations around Open Scholarship progress, and help to provide useful insight for both global co-ordination and local action. We believe this is a step forward in making Open Scholarship the norm.
Ultimately, we expect the impact of widespread adoption of Open Scholarship to be diverse. We expect novel research practices to accelerate the pace of innovation, and therefore stimulate critical industries around the world. We could also expect to see an increase in public trust of science and scholarship, as transparency becomes more normative. As such, we expect interest in Open Scholarship to increase at multiple levels, due to its inherent influence on society and global economics
Foundations for Open Scholarship Strategy Development
This document aims to agree on a broad, international strategy for the implementation of open scholarship that meets the needs of different national and regional communities but works globally.Scholarly research can be idealised as an inspirational process for advancing our collective knowledge to the benefit of all humankind. However, current research practices often struggle with a range of tensions, in part due to the fact that this collective (or “commons”) ideal conflicts with the competitive system in which most scholars work, and in part because much of the infrastructure of the scholarly world is becoming largely digital. What is broadly termed as Open Scholarship is an attempt to realign modern research practices with this ideal. We do not propose a definition of Open Scholarship, but recognise that it is a holistic term that encompasses many disciplines, practices, and principles, sometimes also referred to as Open Science or Open Research. We choose the term Open Scholarship to be more inclusive of these other terms. When we refer to science in this document, we do so historically and use it as shorthand for more general scholarship.The purpose of this document is to provide a concise analysis of where the global Open Scholarship movement currently stands: what the common threads and strengths are, where the greatest opportunities and challenges lie, and how we can more effectively work together as a global community to recognise and address the top strategic priorities. This document was inspired by the Foundations for OER Strategy Development and work in the FORCE11 Scholarly Commons Working Group, and developed by an open contribution working group.Our hope is that this document will serve as a foundational resource for continuing discussions and initiatives about implementing effective strategies to help streamline the integration of Open Scholarship practices into a modern, digital research culture. Through this, we hope to extend the reach and impact of Open Scholarship into a global context, making sure that it is truly open for all. We also hope that this document will evolve as the conversations around Open Scholarship progress, and help to provide useful insight for both global co-ordination and local action. We believe this is a step forward in making Open Scholarship the norm.Ultimately, we expect the impact of widespread adoption of Open Scholarship to be diverse. We expect novel research practices to accelerate the pace of innovation, and therefore stimulate critical industries around the world. We could also expect to see an increase in public trust of science and scholarship, as transparency becomes more normative. As such, we expect interest in Open Scholarship to increase at multiple levels, due to its inherent influence on society and global economics